Back

BMC Bioinformatics

Springer Science and Business Media LLC

Preprints posted in the last 90 days, ranked by how well they match BMC Bioinformatics's content profile, based on 383 papers previously published here. The average preprint has a 0.37% match score for this journal, so anything above that is already an above-average fit.

1
dna-parser: a Python library written in Rust for fast encoding of DNA and RNA sequences

Vilain, M.; Aris-Brosou, S.

2026-01-21 bioinformatics 10.64898/2026.01.20.700656 medRxiv
Top 0.1%
53.4%
Show abstract

BackgroundThe ever-growing amount of available biological data leads modern analysis to be performed on large datasets. Unfortunately, bioinformatics tools for preprocessing and analyzing data are not always designed to treat such large amounts of data efficiently. Notably, this is the case when encoding DNA and RNA sequences into numerical representations, also called descriptors, before passing them to machine learning models. Furthermore, current Python tools available for this preprocessing step are not well suited to be integrated into pipelines resulting in slow encoding speeds. ResultsWe introduce dna-parser, a Python library written in Rust to encode DNA and RNA sequences into numerical features. The combination of Rust and Python allows to encode sequences rapidly and in parallel across multiple threads while maintaining compatibility with packages from the Python ecosystem. Moreover, this library implements many of the most widely used types of numerical feature schemes coming from bioinformaticss and natural language processing. Conclusiondna-parser is an easy to install Python library that offers many Python wheels for Linux (muslinux and manylinux), macOS, and Windows via pip (https://pypi.org/project/dna-parser/). The open source code is available on GitHub (https://github.com/Mvila035/dna_parser) along with the documentation (https://mvila035.github.io/dna_parser/documentation/).

2
thematicGO: A Keyword-Based Framework for Interpreting Gene Ontology Enrichment via Biological Themes

Wang, Z.; Sudlow, L. C.; Du, J.; Berezin, M. Y.

2026-02-10 bioinformatics 10.64898/2026.02.08.704666 medRxiv
Top 0.1%
37.9%
Show abstract

BackgroundGene Ontology (GO) enrichment analysis is a widely used approach for interpreting high-throughput transcriptomic and genomic data. However, conventional GO over-representation analyses typically yield long, redundant lists of enriched terms that are difficult to apply to biological problems and identify the most relevant biological pathways. ResultsWe present thematicGO, a customizable framework that organizes enriched GO terms into biological themes using a curated keyword-based matching strategy. In this approach, GO enrichment of differentially expressed genes is performed using the g:Profiler Application Programming Interface (API), followed by the score aggregation within each theme from contributing individual GO terms. Side-by-side interpretation against conventional GO annotation workflows demonstrates that thematicGO captures related biological outcomes but at the same time substantially reduces redundancy and improves readability. To enhance accessibility, we implemented an interactive, web-deployed graphical user interface (GUI) that enables users to upload gene lists and explore thematic enrichment results. ConclusionthematicGO simplifies functional enrichment analysis by bridging the gap between granular GO term outputs and higher-level biological interpretation using a theme concept, which can be especially useful for RNA-seq studies that identify differentially expressed genes. The new approach complements an orthogonal standard GO enrichment technique with transparent, theme-based aggregation and comparison against classical GO annotation approaches. thematicGO provides an easy, understandable, and reproducible tool for transcriptomic studies, particularly those involving RNA-seq data and complex biological responses.

3
An explainable boosting machine model for identifying artifacts caused by formalin-fixed paraffin embedding

Grether, V.; Goldstein, Z. R.; Shelton, J. M.; Chu, T. R.; Hooper, W. F.; Geiger, H.; Corvelo, A.; Martini, R.; Davis, M. B.; Robine, N.; Liao, W.

2026-03-13 bioinformatics 10.64898/2026.03.10.710815 medRxiv
Top 0.1%
32.6%
Show abstract

BackgroundFormalin-fixed paraffin-embedding (FFPE) is a widely used, cost-effective method for long-term storage of clinical samples. However, fixation is known to introduce damage to nucleic acids that can present as artifactual bases in sequencing otherwise absent from higher fidelity storage methods such as fresh freezing (FF). Various machine learning methods exist for filtering these variant artifacts, but benchmarking performance can be difficult without reliable truth sets. In this study, we employ a collection of 90 paired fresh-frozen and formalin-fixed paraffin embedded samples from the same tumor to robustly define real and FFPE-derived, artifactual variation and enable objective evaluation of filtering methods. To address existing shortcomings, we propose a novel explainable boosting machine (EBM) model that improves performance, can be easily updated with new data, requires modest computational resources, and is analysis pipeline agnostic, making it broadly accessible. ResultsWe evaluated several methods for limiting FFPE-derived variant artifacts using cohorts of B-cell lymphoma samples. We found capturing local context around variants to be a highly informative, under-utilized feature set not commonly incorporated into many existing machine learning methods. Consequently, we developed a novel algorithm, FIFA, for filtering FFPE artifacts, which uses an EBM model, an interpretable decision-tree-based learning algorithm, to address some of the existing shortcomings. We used four independent cohorts composed of paired lymphoma and cervical cancer samples and a breast cancer cell line with both FF and FFPE samples to define clearly annotated training and test sets and demonstrated improved performance over existing methods. Additionally, FIFA filtering increased relevant biological signals in FFPE breast cancer datasets distinct from the training and testing sets. The EBM framework employed by FIFA is computationally efficient and easily amenable to incorporation of additional datasets due to its generalized additive modeling of features making it straightforward to incorporate new data into existing models dynamically over time. ConclusionsOur novel FFPE variant artifact filtering tool, FIFA, is a marked improvement over existing methods. It can be easily implemented, post hoc, to supplement existing somatic calling pipelines, training and inference can be run quickly across most compute environments, and it can be easily updated online as new training data becomes available. Accordingly, FIFA represents an important advance in retrospective cancer genomics research by further enhancing access to the vast stores of FFPE-archived tumor samples currently in existence.

4
geneslator: an R package for comprehensive gene identifier conversion and annotation

Cavallaro, G.; Micale, G.; Privitera, G. F.; Pulvirenti, A.; Forte, S.; Alaimo, S.

2026-04-01 bioinformatics 10.64898/2026.03.30.714723 medRxiv
Top 0.1%
28.9%
Show abstract

MotivationHigh-throughput sequencing generates large gene lists, making data interpretation challenging. Accurate gene annotation and reliable conversion between identifiers (e.g., gene symbols, Ensembl GeneIDs, Entrez GeneIDs) are essential for integrating datasets, conducting functional analyses, and enabling cross-species comparisons. Existing tools and databases facilitate annotation but often suffer from inconsistencies, missing mappings, and fragmented workflows, limiting reproducibility and interpretability. ResultsTo address these limitations, we developed geneslator, an R package that unifies gene identifier conversion, orthologs mapping, and pathway annotation across eight model organisms (Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Danio rerio, Saccharomyces cerevisiae, Caenorhabditis elegans, Arabidopsis thaliana). geneslator provides an up-to-date, precise, and coherent framework that preserves data integrity, enables cross-species analyses, and facilitates robust interpretation of gene function and regulation, outperforming state-of-the-art gene annotation tools. Availabilitygeneslator is available at https://github.com/knowmics-lab/geneslator. Contactgrete.privitera@unict.it

5
MOAflow: how re-design a pipeline with Nextflow streamlines data analysis

Tartaglia, J.; Giorgioni, M.; Cattivelli, L.; Faccioli, P.

2026-03-30 bioinformatics 10.64898/2026.03.26.713914 medRxiv
Top 0.1%
28.7%
Show abstract

BackgroundAdvances in high-throughput DNA sequencing technologies have dramatically reduced the time and cost required to generate genomic data. As sequencing is no longer a limiting factor, increasing attention must be paid to optimizing the analyses of the large-scale datasets produced. Efficient processing of such data is essential to reduce computational time and operational costs. In this context, workflow management systems (WMSs) have become key instruments for orchestrating complex bioinformatic pipelines. Among these systems, Nextflow has emerged as one of the most widely adopted solutions in bioinformatics. MethodsTo improve scalability and computational efficiency, we employed Nextflow to re-design an already existing pipeline dedicated to the analysis of MNase-defined cistrome-Occupancy (MOA-seq) data. The re-engineering process focused on modularizing the workflow and integrating containerization technologies to ensure reproducibility and easier deployment across heterogeneous computing environments. ResultsThe resulting workflow, named MOAflow, represents a modernized and fully containerized pipeline for MOA-seq data analysis. With only Docker and Nextflow required, the pipeline guarantees high portability and reproducibility. The data of the original article was used to benchmark the new pipeline. Its outputs closely match those of the original study with minor variations. ConclusionsMOAflow demonstrates how the adoption of robust WMS can substantially enhance the performance and usability of pre-existing bioinformatic pipelines. By leveraging containerization and Nextflow, it ensures consistent results across platforms while minimizing setup complexity. This work highlights the value of modern WMS-driven approaches in meeting the computational demands.

6
1100 Synthetic Benchmark Problems for Dynamic Modeling of Cellular Processes

Neubrand, N.; Rachel, T.; Litwin, T.; Timmer, J.; Kreutz, C.; Hess, M.

2026-03-13 systems biology 10.64898/2026.03.10.710893 medRxiv
Top 0.1%
24.6%
Show abstract

MotivationSystems biology strives to unravel the complex dynamics of cellular processes, often with the help of ordinary differential equations (ODEs). However, the sparsity of measured data and the strong non-linearity of common ODEs introduce severe numerical problems in typical modeling tasks. This gave rise to the development of many computational algorithms that must be systematically evaluated to ensure optimal method choices. Currently, the amount of well curated models for such benchmarking efforts is insufficient, as building and calibrating biologically reasonable models based on experiments requires years of work. ResultsWe present a large-scale collection of 1100 synthetic modeling problems, generated based on the ODE systems and experimental designs of 22 published modeling problems. This is achieved by extending a recent method for simulation of time-course data for randomly generated observation functions to also include realistic measurement patterns across multiple experimental conditions. By analyzing data and model characteristics, optimization performance and parameter identifiability, we show that the synthetic problems provide both a realistic and diverse extension of the existing problem space. Hence, the synthetic collection provides a valuable resource for benchmarking in dynamic modeling. Availability and ImplementationBenchmark problems and algorithm are publicly available at https://github.com/niklasneubrand/1100SyntheticBenchmarksODE and https://zenodo.org/records/14008247.

7
From SNPs to Pathways: A genome-wide benchmark of annotation discrepancies and their impact on protein- and pathway-level inference

Queme, B.; Muruganujan, A.; Ebert, D.; Mushayahama, T.; Gauderman, W. J.; Mi, H.

2026-03-24 bioinformatics 10.64898/2026.03.21.713397 medRxiv
Top 0.1%
23.7%
Show abstract

BackgroundAccurate single-nucleotide polymorphism (SNP) annotation is central to genomic research yet widely used tools and gene models often yield divergent results. Prior studies have shown such discrepancies in small datasets, but the extent of genome-wide variation and its impact on downstream pathway analysis remain unclear. ResultsWe conducted a comprehensive comparison of three commonly used SNP annotation tools, ANNOVAR, SnpEff, and VEP, using both Ensembl and RefSeq gene models to evaluate more than 40 million SNPs from the Haplotype Reference Consortium. At the protein level, annotation output differed significantly across tools and gene models (p-adj < 0.001), with discrepancies present in both genic and intergenic regions. RefSeq produced broader annotation coverage, particularly for intergenic SNPs, while Ensembl showed greater internal consistency. SnpEff provided the most complete coverage overall, whereas no single tool or model configuration achieved full annotation recovery of the union reference. Integration across tools and models maximized coverage and reduced annotation loss. In a case study of 204 colorectal cancer-associated SNPs from the FIGI GWAS, pathway enrichment results varied depending on annotation strategy. The fully integrated approach identified all four significant pathways, whereas several single-tool or single-model strategies missed one or more. ConclusionSNP annotation outcomes are influenced by both the tool and gene model used, and relying on a single approach may result in incomplete coverage. A multi-tool, multi-model strategy provides the most comprehensive annotation and preserves enriched pathways, supporting more robust and reproducible genomic interpretation.

8
DPGT: A spark based high-performance joint variant calling tool for large cohort sequencing

Gong, C.; Yang, Q.; Wan, R.; Li, S.; Zhang, Y.; Li, Y.

2026-03-05 bioinformatics 10.64898/2026.03.02.709184 medRxiv
Top 0.1%
23.5%
Show abstract

BackgroundJoint variant calling is a crucial step in population-scale sequencing analysis. While population-scale sequencing is a powerful tool for genetic studies, achieving fast and accurate joint variant calling on large cohorts remains computationally challenging. FindingsTo meet this challenge, we developed Distributed Population Genetics Tool (DPGT), an efficient computing framework and a robust tool for joint variant calling on large cohorts based on Apache Spark. DPGT simplifies joint calling tasks for large cohorts with a single command on a local computer or a computing cluster, eliminating the need for users to create complex parallel workflows. We evaluated the performance of DPGT using 2,504 1000 Genomes Project (1KGP), 6 Genome in a Bottle (GIAB) and 9,158 internal whole genome sequencing (WGS) samples together with existing methods. As a result, DPGT produced results comparable in accuracy to existing methods, with less time and better scalability. ConclusionsDPGT is a fast, scalable, and accurate tool for joint variant calling. The source code is available under a GPLv3 license at https://github.com/BGI-flexlab/DPGT, implemented in Java and C++.

9
BCAR: A fast and general barcode-sequence mapper for correcting sequencing errors

Andrews, B.; Ranganathan, R.

2026-03-31 bioinformatics 10.64898/2026.03.27.714882 medRxiv
Top 0.1%
23.5%
Show abstract

MotivationDNA barcodes are commonly used as a tool to distinguish genuine mutations from sequencing errors in sequencing-based assays. In the presence of indel errors, utilizing barcodes requires accurate alignment of the raw reads to distinguish genuine indels from indel errors. Existing strategies to do this generally rely on aligners built for homology comparison and do not fully utilize quality scores. We reasoned that developing an aligner purpose-built for error correction could yield higher quality barcode-sequence maps. ResultsHere, we present BCAR, a fast barcode-sequence mapper for correcting sequencing errors. BCAR considers all of the evidence for each base call at each position both during alignment and during final consensus generation. BCAR creates high-accuracy barcode-sequence maps from simulated reads across a broad range of error rates and read lengths, outperforming existing methods. We apply BCAR to two experimental datasets, where it generates high-quality barcode-sequence maps. Availability and implementationBCAR source code, documentation and test data are available from: https://github.com/dry-brews/BCAR

10
Sequence-to-graph alignment based copy number calling using a network flow formulation

Magalhaes, H.; Weber, J.; Klau, G. W.; Marschall, T.; Prodanov, T.

2026-02-24 bioinformatics 10.1101/2025.11.21.689771 medRxiv
Top 0.1%
23.5%
Show abstract

Variation of sequence copy number (CN) between individuals can be associated with phenotypical differences. Consequently, CN calling is an important step for disease association and identification, as well as for genome assembly validation. Traditionally, CN calling is done by mapping sequencing reads to a linear reference genome and estimating the CN from the observed read depth. This approach, however, is significantly hampered by sequences and rearrangements not present in a linear reference genome; at the same time simple CN prediction for individual graph nodes does not make use of the graph topology and can lead to inconsistent results. To address these issues, we propose Floco, a method for CN calling with respect to a genome graph using a network flow formulation. Given a graph and alignments against that graph, we calculate raw CN probabilities for every graph node based on the Negative Binomial distribution and the base pair coverage across the node, and then use integer linear programming to compute the CN flow through the whole graph. We tested this approach on 15 aligned datasets, involving three different graphs, as well as HiFi and ONT sequencing reads and linear assemblies split into reads. These results demonstrate that the addition of the network flow formulation increases the accuracy of CN predictions by up to 43% when compared with read depth based estimation alone. Additionally, we observed that concordance between predictions from the three different sequence sources was able to reach 93.2%. Floco fills a gap in CN calling tools specifically designed for genome graphs.

11
Assessing the impact of parental linear gene normalization on the performance of statistical models for circular RNA differential expression analysis

Qorri, E.; Varga, V.; Priskin, K.; Latinovics, D.; Takacs, B.; Pekker, E.; Jaksa, G.; Csanyi, B.; Torday, L.; Bassam, A.; Kahan, Z.; Pinter, L.; Haracska, L.

2026-03-09 bioinformatics 10.64898/2026.03.06.710045 medRxiv
Top 0.2%
22.9%
Show abstract

BackgroundCircular RNAs (circRNAs) emerged as promising non-invasive cancer biomarkers due to their stability, abundance in body fluids, and regulatory potential. However, circRNA differential expression analysis (DEA) remains challenging, largely owing to lack of consensus on important preprocessing strategies such as filtering and normalization. While well-established bulk RNA-sequencing frameworks are commonly applied to circRNA data, newer approaches such as CIRI-DE (part of CIRI3 suite) integrate both linear and circular transcript information to improve detection. Despite developments, an assessment of these integrative strategies is lacking, and the critical impact of filtering on DEA model performance has not been comprehensively evaluated. ResultsIn this study, we evaluated the impact of multiple normalization and filtering strategies on circRNA DEA using five experimental datasets, including two in-house blood platelet sets and semi-parametric simulated in silico datasets. Our results emphasize the importance of selecting an appropriate filtering threshold, as overly lenient filtering substantially reduced model performance across datasets. We found edgeRs filterByExpr() strategy particularly effective in handling zero counts in circRNA data, while also generating the most reliable results across most datasets. Furthermore, by incorporating linear and circular information as described in CIRI-DE, most methods identified a higher number of differentially expressed (DE) circRNAs compared to circular counts alone. Notably, circRNAs identified by both CIRI-DE and the modified bulk RNA-sequencing pipelines showed substantial overlap. ConclusionOur findings demonstrate that automated filtering combined with linear-aware normalization significantly enhances the sensitivity and reproducibility of circRNA DEA, providing a standardized framework for more reliable biomarker discovery in transcriptomic research.

12
DEPower: approximate power analysis with DESeq2

Gorin, G.; Guruge, D.; Goodman, L.

2026-02-09 bioinformatics 10.64898/2026.02.05.704084 medRxiv
Top 0.2%
22.5%
Show abstract

Rigorous experimental design, including formal power analysis, is a cornerstone of reproducible RNA sequencing (RNA-seq) research. The design of RNA-seq experiments requires computing the minimum sample number required to identify an effect of a particular size at a predefined significance level. Ideally, the statistical test used for the analysis of experimental data should match the test used for sample size determination; however, few tools use the assumptions of the popular differential expression testing framework DESeq2, and most opt for simulation-based rather than analytical approaches. Grounded in the DESeq2 model framework, we derive sample size requirements for both single-cell and bulk RNA-seq experiments delivered as a web-based tool for power analysis, DEPower, available at https://poweranalysis-fb.streamlit.app/ that makes rigorous RNA-seq study design accessible to all researchers.

13
Calibration improves estimation of linkage disequilibrium on low sample sizes

Bercovich Szulmajster, U.; Wiuf, C.; Albrechtsen, A.

2026-03-07 bioinformatics 10.64898/2026.03.05.709321 medRxiv
Top 0.2%
22.4%
Show abstract

Linkage disequilibrium is a central statistic in population genetic studies, commonly measured by the squared correlation between pairs of genetic variants. An important drawback of this measure is its upward bias caused by a finite sample size. To handle this, different methods exist that correct for sample-size bias. However, because the correlation consists of a ratio, there is no unbiased method to compute it. In this work, we present a procedure to calibrate those methods using a non-parametric approach with simulated data. This is done with forward modeling to generate genotype matrices with known parameters, followed by an inverse mapping to recover estimates of the underlying parameters. Then, a mean-centering calibration is applied to the recovered estimate of the true parameter. This approach is applied to real and simulated data, showing consistent improvement in accuracy compared to other sample-size-aware methods. Furthermore, to study the effects on downstream analyses, we analyze the classification performance on LD pruning, where we also observe an improvement, particularly in extreme cases with low sample sizes of 5 or 10 individuals.

14
orthogene: a Bioconductor package to easily map genes within and across hundreds of species

Schilder, B. M.; Skene, N. G.; Murphy, A. E.

2026-01-21 bioinformatics 10.64898/2026.01.17.700094 medRxiv
Top 0.2%
22.3%
Show abstract

MotivationMapping genes across identifier systems and species is a routine but critical step in bioinformatics workflows. Despite its ubiquity, gene mapping is frequently handled using bespoke, ad hoc solutions, increasing duplicated effort and introducing opportunities for error. These issues are exacerbated by the prevalence of non-one-to-one homolog relationships and inconsistent handling of gene identifiers across species and databases, which can compromise downstream analyses and reproducibility. ResultsWe present orthogene, an R/Bioconductor package that simplifies gene mapping within and across hundreds of species. orthogene provides a unified, workflow-oriented framework that integrates automated species and identifier standardization, homolog inference across multiple databases, flexible handling of ambiguous homolog relationships, and transformation of gene lists, tables, and high-dimensional matrices into analysis-ready formats. By abstracting common sources of technical complexity while retaining user control, orthogene enables transparent, reproducible, and scalable gene mapping across a wide range of biological contexts. Availabilityhttps://bioconductor.org/packages/orthogene Contactbrian_schilder@alumni.brown.edu

15
HORDCOIN: A Software Library for Higher Order Connected Information and Entropic Constraints Approximation

Raffaelli, G. T.; Kislinger, J.; Kroupa, T.; Hlinka, J.

2026-02-10 bioinformatics 10.64898/2026.02.08.704639 medRxiv
Top 0.2%
22.2%
Show abstract

Background and objectiveQuantifying higher-order statistical dependencies in multivariate biomedical data is essential for understanding collective dynamics in complex systems such as neuronal populations. The connected information framework provides a principled decomposition of the total information content into contributions from interactions of increasing order. However, its application has been limited by the computational complexity of conventional maximum entropy formulations. In this work, we present a generalised formulation of connected information based on maximum entropy problems constrained by entropic quantities. MethodsThe entropic-constraint approach, contrasting with the original constraints based on marginals or moments, transforms the original nonconvex optimisation into a tractable linear program defined over polymatroid cones. This simplification enables efficient, robust estimation even under undersampling conditions. ResultsWe present theoretical foundations, algorithmic implementation, and validation through numerical experiments and real-world data. Applications to symbolic sequences, large-scale neuronal recordings, and DNA sequences demonstrate that the proposed method accurately detects higher-order interactions and remains stable even with limited data. ConclusionsThe accompanying open-source software library, HORDCOIN (Higher ORDer COnnected INformation), provides user-friendly tools for computing connected information using both marginal- and entropy-based formulations. Overall, this work bridges the gap between abstract information-theoretic measures and practical biomedical data analysis, enabling scalable investigation of higher-order dependencies in neurophysiological and other complex biological systems such as the genome.

16
Benchmarking within-sample minority variant detection with short-read sequencing in M. tuberculosis

Mulaudzi, S.; Kulkarni, S.; Marin, M. G.; Farhat, M. R.

2026-02-16 bioinformatics 10.64898/2026.02.13.704885 medRxiv
Top 0.2%
22.2%
Show abstract

BackgroundLow-frequency (minority) variants--variants detectable within a sample at low allele frequencies--are relevant in several areas of research and health, ranging from cancer to pathogen heteroresistance. There is uncertainty around the optimal bioinformatic approach to accurately and reproducibly distinguish low-frequency variants from sequencing or mapping error. To address this we benchmarked seven variant callers on precision, recall and false positive characteristics for detecting low-frequency variants using simulated short-read whole genome sequencing data for 700 Mycobacterium tuberculosis strains. We developed a new low-frequency error model for filtering output of the best performing tool using read mapping and quality metrics. ResultsWe simulated 378 unique variants across 5 genomic backgrounds spanning 4 lineages. Variants were simulated to represent 3 genomic region categories, 10 allele frequencies and 5 sequencing depths. FreeBayes, a haplotype-based variant caller, achieved the highest pooled F1 score of the seven tools in drug resistance regions (average F1 = 0.86) and its higher performance held across genomic context and background. Across tools, we identified lower performance in repetitive (low mappability) regions, and strong reference bias in low-frequency variant calling. We validated variant caller performance on a sample of in-vitro strain mixtures substantiating our ranking. When paired with FreeBayes, the error model excludes 49% of false variants and <1% of true variants. ConclusionsOur analysis provides evidence to support best practices for low-frequency variant calling, including tool choice, masking and filtering. We also develop and provide a new error model that excludes false positive low-frequency variant calls from FreeBayes output.

17
PanACRpred: Predicting Accessible Chromatin Regions in Pangenomes using Motif Chaining

Warr, M. J.; Dinh, T.; Root, B.; Onstott, E.; Yu, K.; Mudge, J.; Ramaraj, T.; Kahanda, I.; Mumey, B.

2026-02-06 bioinformatics 10.64898/2026.02.05.703812 medRxiv
Top 0.2%
21.9%
Show abstract

In this work, we investigate using motif subsequence features to predict whether a genomic region is accessible to regulatory proteins, i.e. an accessible chromatin region (ACR), enabling transcription of associated genes. We focus on plants, whose agricultural and ecological importance make them interesting and important organisms to study, and whose complex genomes provide important stress tests for our algorithm. We show that motif sequence similarity as found by co-linear chaining can be used in combination with machine learning models to effectively predict ACRs in genome assemblies.

18
BoolDog: integrated Boolean and semi-quantitative network modelling in Python

Bleker, C.; Zagorscak, M.; Blejec, A.; Gruden, K.; Zupanic, A.

2026-03-17 systems biology 10.64898/2026.03.16.711264 medRxiv
Top 0.2%
19.1%
Show abstract

SummaryBoolean and logic-based modeling approaches are well suited for the analysis of complex biological systems, particularly when detailed biochemical and kinetic information is unavailable. In such settings, biological pathways are represented as networks capturing system components and their interactions, providing a simplified yet informative abstraction of system behavior. While the structural topology of these networks is often well characterized, the absence of mechanistic detail limits the applicability of parameter-dependent modeling frameworks. To address this, we present BoolDog, a Python package for the construction, simulation, and analysis of Boolean and semi-quantitative Boolean networks. BoolDog supports synchronous simulation with events, attractor and steady-state identification, network visualization, and the systematic transformation of logic-based models into continuous ordinary differential equation (ODE) systems -- enabling the seamless integration of discrete and continuous modeling paradigms. Networks can be imported and exported across standard formats, and BoolDog integrates natively with established Python libraries for network analysis and visualisation, including NetworkX, igraph, and py4Cytoscape. Together, these capabilities provide a flexible, accessible, and interoperable platform for logic-based modeling of complex biological systems. Availability and implementationBoolDog is implemented in Python and available at https://github.com/NIB-SI/BoolDog/.

19
A systematic assessment of machine learning for structural variant filtering

Kalra, A.; Paulin, L.; SEDLAZECK, F.

2026-01-30 bioinformatics 10.64898/2026.01.27.702059 medRxiv
Top 0.3%
18.8%
Show abstract

BackgroundAccurate discrimination of true structural variants (SVs) from artifacts in long-read sequencing data remains a critical bottleneck. Numerous machine learning solutions have been proposed, ranging from classical models using engineered features to advanced deep learning and foundation model interpretability methods. However, a systematic comparison of their performance, efficiency, and practical utility is lacking. ResultsWe conducted a comprehensive benchmark of five machine learning paradigms for SV filtering using standardized Genome in a Bottle (GIAB) data for samples HG002 and HG005. We evaluated classical Random Forest classifiers on 15 genomic features, computer vision models (ResNet/VICReg), diffusion-based anomaly detection, sparse autoencoders (SAEs) on the Evo2-7B foundation model, and multimodal ensembles. A simple Random Forest on interpretable features achieved a peak F1-score of 95.7%, effectively matching all more complex models (ResNet50: 95.9%, Diffusion: 95.8%). This study represents the first application of diffusion-based anomaly detection and sparse autoencoders to structural variant analysis; while diffusion models learned highly discriminative, disentangled representations and SAEs uncovered biologically interpretable features (including atoms that were specific for ALU deletions, chromosome X variants and insertion events), they did not significantly surpass this classification ceiling. Ensemble methods offered no performance benefit but may have future potential given the orthogonality of vision-based and linear features. ConclusionsOur findings demonstrate that for the established task of germline SV filtering, simpler, interpretable models provide an optimal balance of accuracy, speed, and transparency. This benchmark establishes a pragmatic framework for method selection and argues that increased model complexity must be justified by clear, unmet biological needs rather than marginal predictive gains.

20
HP2NET: Empowering Efficient Phylogenetic Network Analysis through High-Performance Computing

Terra, R.; Carvalho, D.; Machado, D. J.; Osthoff, C.; Ocana, K.

2026-03-08 bioinformatics 10.64898/2026.03.05.709005 medRxiv
Top 0.3%
18.8%
Show abstract

Advances in High-Performance Computing (HPC) have enabled increasingly complex genomic analyses, including those in phylogenomics. These analyses contribute to understanding the evolution of viruses and pathogens, improving our knowledge of disease transmission, and supporting targeted public health strategies. However, due to the increasing number of tools and processing steps involved, executing these analyses manually, step by step, becomes error-prone and inefficient. To address this challenge, we present HP2NET, a robust framework for reproducible, efficient, and scalable phylogenetic network analysis. HP2NET integrates five workflows based on state-of-the-art tools such as PhyloNetworks and PhyloNet, allowing the analysis of multiple datasets and workflows in a single execution. The framework includes features such as task packaging and data reuse to improve performance and resource utilization in HPC environments. We perform a comprehensive performance evaluation of the software used within HP2NET, identifying bottlenecks and analyzing gains from parallel processing. Data reuse provided up to 15.35% time reduction, for a small dataset, in our experimental environment, while parallel execution of the five pipelines reduced total runtime by up to 90.96% compared to sequential runs. Finally, we validate HP2NET in a real-world case study by analyzing Dengue virus genomes, demonstrating its applicability value for large-scale phylogenetic analyses.